This report analyzes how a new youtuber should upload content to get more subscribers. To analyze this problem, we first cleaned up the source data and then performed regression analysis. After that, the accuracy of KNN and decision tree model is compared. Finally, we analyze the data to make the data visualization. Our results show that for a new youtuber, he or she needs to focus on upload music or entertainment content in order to get a greater number of views.
This project, we use the dataset from the internet, but developed new things and focus on the subject we are interested in. The primary goal of the project is to give some insights for new youtubers to betterly get adapted and decide which area of videos would they invest more in the future. The project will be centered around these questions: 1. what is the current youtube market look like? 2. Can we get some inference about the views of a video from the information of features we already know and try to predict some values of the videos? 3. Can we construct a model to forecast the performance of youtube videos? 4. What types of youtube videos could have relatively high return? We would answer all of these questions in the following.
We selected the data from March 1, 2018, to June 1, 2018 added counrty as a feature to the dataset.
## 2.0 Regression Analysis
Use variables to predict the views of Youtube Videos. This might help a potential youtuber have a rough idea what elements would influence the ultimate views of the videos. We plan to use regression analysis to find out the whether variables like ‘likes’, ‘dislikes’, ‘comment_ count’, ‘category_id’, ‘dif_date’ and ‘title_word’ will have significant prediction effect on the outcome of views, and try to find the best linear regression model using the step-wise regression method.
## 3.0 KNN & Decision Tress
Performed two models to find out the better one to help us make the decision.
## 4.0 Data analysis
This section mainly consists of three parts. Firstly, we performed analysis on the general data, using plots to tell us the total/average views/likes/dislikes/comments based on different video categories, helping new youtubers to know more about the video market and the potential areas. Secondly, we performed analysis on the top5000 videos in our dataset, picking out some typical characteristic of them to figure out about the audience’ tastes on different types of videos. Finally, we analyzed the title of the top5000 videos, separating the words in them to get the top10 words that would appear in hot videos.
The first thing we did is data cleaning. The data set we choose is very large. Although the values displayed are relatively complete, we found that in the data collected, the release time of the earliest video in different countries is different. This leads to an incomplete comparison. Therefore, under the condition that the data is absolutely sufficient, we intercepted the data from March 1, 2018, to June 1, 2018. At the same time, we coded each country and added them as a feature to the dataset to better explain the variability. Then we conducted data normalization using formula ((x - mean(x)) / Sd(x)).
## 2.0 regression analysis
### 2.0.1 Correlation
We all know that the ultimate goal of a new youtuber is to have the videos popular. That is to say, Views of Videos are the most thing that a youtuber care about. In order to give new youtubers a clearer idea of the relationship between each video features. We first conduct a correlation analysis. We created a matrix that includes ‘views’, ‘category_id’, ‘likes’, ‘dislikes’, ‘comment_count’, ‘title_words’. From the outcome, we know that views of videos has strong positive correlation with ‘likes’(0.78) and ‘dislikes’(0.74), and have a median-level positive correlation with ‘comment_count’(0.54).
We use ‘pairs’ function to give us a rough idea of the relationships between each variables. The picture shows something obvious to us. For example, ‘views’ has positive correlation with ‘likes’, ‘dislikes’, and ‘comment_counts’. ‘dif_date’ and ‘title_words’ seem not to be able to explain ‘views’ well. Also, in variables like ‘views’, ‘like’, ‘dislike’, ‘comment_count’, they seem to have strong correlation with each other from the picture shown to us.
Based on the research above, we now know that the variables we choose are sure to explain some of the predictor - Video Views. Then we conducted the linear regression. We regress views on 6 variables (category_id, likes, dislikes, comment_count, dif_days, title_words). We get an adjusted r-square of 0.757, meaning that the variales we choose to predict views might be a good fit. In addition, from the outcome of the regression, four of the six variables we choose have very significant parameter, which means that it makes sense to take them into consideration. The four significant variables are ‘likes’, ‘dislikes’, ‘comment_count’ and ‘title_words’.
Finding that the parameter of ‘dif_date’ and ‘category_id’ is not that obvious, we choose to delete them from the regression function and create a new one using the left four variables. At the same time, we perform the step-wise regression to find out the best model to fit. The result goes to: views ~ likes + dislikes + comment_count + title_words + likes:title_words + comment_count:title_words + likes:dislikes + dislikes:comment_count + dislikes:title_words + likes:comment_count. After the step-wise regression, we get a better adjusted r square, which is 0.80, meaning that the present function can be a good model to help us predicting the views using other variables. For a new youtuber, he/she can not only understand which factors may influence the video views, but can each try to estimate the views of a video given some other values.
## 3.0 KNN & Decision Tress
### 3.0.1 KNN
K-Nearest Neighbors algorithm (KNN) is one of the methods we chose. The immediate reason why we chose it is that it is a very simple, inert, nonparametric algorithm. KNN has relatively high accuracy, it does not need to be compared with better supervised learning models, and we do not need to make additional assumptions, adjust multiple parameters, or build models. However, it still has many disadvantages. For example, the accuracy of KNN fluctuates due to the quality of the data. Moreover, for big data, its prediction stage may be very slow. In addition, it will be sensitive to some irrelevant features, so we need to spend more energy on screening features. #### Result - Since the dependent variable is not discrete, we cannot use confusion matrix to show the accuracy of the model, so we calculate the RMSE, MSE and MAE of KNN. When K =5, KNN has the smallest predicted RMSE (0.1540642) (shown in appendix), meanwhile, MSE is 0.02373578 and MAE is 0.4635796.As shown in the figure, the coincidence degree between the real value and the predingvcf fgcted value is relatively high. Combined with the RMSE shown earlier, this model performs well.
## MSE: 0.02644906 MAE: 0.04654853 RMSE: 0.1626317
### 3.0.2 Decision Tree
We also chose decision tree algorithm for comparison. A decision tree is a graphical way of representing choices and their consequences, and it is a very powerful supervised learning algorithm that can fit complex data sets and make very fast predictions and easily identify important variables and deal with missing data. Decision trees allow us to understand outcomes that convey explicit conditions based on the original variables. Because it doesn’t require a lot of computation to process, we can easily program the model, which is a big part of why we chose it. After establishing the decision tree model, we use the cross validation to adjust the decision tree model, specify complexity parameters, adjust length and Gini index to split branches.
## MSE: 0.598091 MAE: 0.2269064 RMSE: 0.7733634
The RMSE, MSE and MAE of decision tree is 0.6594204, 0.4348353, 0.2191245. They are pretty high so that can prove this model perform bad. Also from the figure, we can know that the superposition of true value and predicted value is not obvious.
Decision tree(tuned)
## MSE: 0.04783897 MAE: 0.04490255 RMSE: 0.2187212
According to the image, the coincidence degree between the real value and the predicted value is significantly improved, which is a better model.
In general, the performance of KNN is better than the decision tree. Although the accuracy of the decision tree after adjustment is greatly improved, and it is very similar to KNN in terms of the coincidence degree between the real value and the predicted value, the performance of KNN is more outstanding for this data set based on the performance of RMSE, MSE and MAE. As a Youtuber, I would choose to use the KNN model to predict the number of view of my videos, and gain experience to adjust my content to attract subscriber.
### 4.0.1 Analysis on a general base
In the first part, we analyze on the general data, using all the video data we have to calculate the total and average value of each variable on the basis of category_id. With this method, we can have a better view of the performance of each type of videos. As there are four different pictures generated, I will select a typical one which shows us the average views of each category and put the left three pictures in the appendix section.
The picture tells us that ‘Music’ Video has incredibly high average views when compared with other categories. It’s more than four times as much as other’s average views. Also, ‘Film and Animation’ and ‘Science & Technology’ have higher average views than others.
### 4.0.2 Analysis on a Top5000 videos
As we have a great number of observations collected, we decided to look more on more popular videos. Trying to figure out the common characteristics of them and make it easier for a new youtuber to get prepared. We rank the views of all the video data we get, and select the top5000-viewed videos(top5%). We created pie chart to reveal the proportion of each categories’ video in the top5000 list. In the dimension of views, ‘music’ is absolutely the top 1 popular video type. 70.6% of the top5000 videos are music related. Also, ‘Entertainment’ is also a good topic to try with a proportion of 14.7%.
We also did some research on videos that people dislikes. The top2 is just the same as in the list of most viewed videos. ‘Music’ counts for 58.8% of the top5000 videos that people don’t like. ‘Entertainment’ counts for 18.7%. This is propabably because huge numbers of these types of videos appear on Youtube, they may include good or bad qualities. The sample size is too big for the two category so that the deviation must also be very big.
We hope that the pie chart above would give new youtubers some instinction about which direction they should dive in. As it is always that truth that, there is less competition in the unpopular categories and it may be much easier for new youtuber’s to do it well. Choice always counts a lot.
In the last part of our analysis, we decided to do some research on the popular videos’ title. We want to find out what words appears most frequently. The result is shown below, among all the top5000 videos, ‘officials’ and ‘trailer’ are the words that appear most. New youtubers may even try to include these words in their videos in order to get more attention from people who are very allergic to some specific keywords.
The project consists of four sections: data cleaning, regression analysis, knn & decision tree testing and data visualization. We got the opportunity to do some study in youtube videos, and have a look at the youtube video market status from different angles. The four different sections help us better understand the youtube video market. In exploring the topic of ‘How to become a succcessful youtuber’, we gain some new insights into this area. In conclusion, we know the relationship between each video’s features, and create several good models to help us do the forecast work. Also, we use some plot to reveal the trending of present youtube videos, giving new youtubers some inspiration in the future about which direction might be a good choice.
# Appendix